import pandas


import pandas as pd


pd.__version__

'1.2.4'


%pdoc pd

Class docstring:
    pandas - a powerful data analysis and manipulation library for Python
    =====================================================================
    
    **pandas** is a Python package providing fast, flexible, and expressive data
    structures designed to make working with "relational" or "labeled" data both
    easy and intuitive. It aims to be the fundamental high-level building block for
    doing practical, **real world** data analysis in Python. Additionally, it has
    the broader goal of becoming **the most powerful and flexible open source data
    analysis / manipulation tool available in any language**. It is already well on
    its way toward this goal.
    
    Main Features
    -------------
    Here are just a few of the things that pandas does well:
    
      - Easy handling of missing data in floating point as well as non-floating
        point data.
      - Size mutability: columns can be inserted and deleted from DataFrame and
        higher dimensional objects
      - Automatic and explicit data alignment: objects can be explicitly aligned
        to a set of labels, or the user can simply ignore the labels and let
        `Series`, `DataFrame`, etc. automatically align the data for you in
        computations.
      - Powerful, flexible group by functionality to perform split-apply-combine
        operations on data sets, for both aggregating and transforming data.
      - Make it easy to convert ragged, differently-indexed data in other Python
        and NumPy data structures into DataFrame objects.
      - Intelligent label-based slicing, fancy indexing, and subsetting of large
        data sets.
      - Intuitive merging and joining data sets.
      - Flexible reshaping and pivoting of data sets.
      - Hierarchical labeling of axes (possible to have multiple labels per tick).
      - Robust IO tools for loading data from flat files (CSV and delimited),
        Excel files, databases, and saving/loading data from the ultrafast HDF5
        format.
      - Time series-specific functionality: date range generation and frequency
        conversion, moving window statistics, date shifting and lagging.


ages  = [41, 56, 56, 57, 39, 59, 43, 56, 38, 60]


pd.DataFrame(ages)


df_ages = pd.DataFrame(ages)
df_ages.head(3)


data = {
    "Name": ["Liu", "Rowland", "Rivers", "Waters", "Rice", "Fields", "Kerr", "Romero", "Davis", "Hall"],
    "Age": ages
}
print(data)

{'Name': ['Liu', 'Rowland', 'Rivers', 'Waters', 'Rice', 'Fields', 'Kerr', 'Romero', 'Davis', 'Hall'], 'Age': [41, 56, 56, 57, 39, 59, 43, 56, 38, 60]}


df_sample = pd.DataFrame(data)
df_sample.head(4)


df_sample.columns

Index(['Name', 'Age'], dtype='object')


df_sample.index

RangeIndex(start=0, stop=10, step=1)


df_sample.set_index("Name", inplace=True)
df_sample


df_sample.describe()


df_sample.T


df_sample.T.columns

Index(['Liu', 'Rowland', 'Rivers', 'Waters', 'Rice', 'Fields', 'Kerr',
       'Romero', 'Davis', 'Hall'],
      dtype='object', name='Name')


df_sample.multiply(2).head(3)


df_sample.reset_index().multiply(2).head(3)


(df_sample / 2).head(3)


(df_sample * df_sample).head(3)


def mysquare(number: float) -> float:
    return number*number

df_sample.apply(mysquare).head()
# or: df_sample.apply(lambda x: x*x).head()


df_sample.apply(np.square).head()


df_sample > 40


df_sample.apply(mysquare).head() == df_sample.apply(lambda x: x*x).head()


happy_dinos = {
    "Dinosaur Name": [],
    "Favourite Prime": [],
    "Favourite Color": []
}
#df_dinos =


happy_dinos = {
    "Dinosaur Name": ["Aegyptosaurus", "Tyrannosaurus", "Panoplosaurus", "Isisaurus", "Triceratops", "Velociraptor"],
    "Favourite Prime": ["4", "8", "15", "16", "23", "42"],
    "Favourite Color": ["blue", "white", "blue", "purple", "violet", "gray"]
}
df_dinos = pd.DataFrame(happy_dinos).set_index("Dinosaur Name")
df_dinos.T


df_demo = pd.DataFrame({
    "A": 1.2,
    "B": pd.Timestamp('20180226'),
    "C": [(-1)**i * np.sqrt(i) + np.e * (-1)**(i-1) for i in range(5)],
    "D": pd.Categorical(["This", "column", "has", "entries", "entries"]),
    "E": "Same"
})
df_demo


df_demo.sort_values("C")


df_demo.round(2).tail(2)


df_demo.round(2).sum()

A                     6.0
C                   -2.03
E    SameSameSameSameSame
dtype: object


print(df_demo.round(2).to_latex())

\begin{tabular}{lrlrll}
\toprule
{} &    A &          B &     C &        D &     E \\
\midrule
0 &  1.2 & 2018-02-26 & -2.72 &     This &  Same \\
1 &  1.2 & 2018-02-26 &  1.72 &   column &  Same \\
2 &  1.2 & 2018-02-26 & -1.30 &      has &  Same \\
3 &  1.2 & 2018-02-26 &  0.99 &  entries &  Same \\
4 &  1.2 & 2018-02-26 & -0.72 &  entries &  Same \\
\bottomrule
\end{tabular}


pd.read_json("data-lost.json").set_index("Character").sort_index()


!head data-nest.csv

id,Nodes,Tasks/Node,Threads/Task,Runtime Program / s,Scale,Plastic,Avg. Neuron Build Time / s,Min. Edge Build Time / s,Max. Edge Build Time / s,Min. Init. Time / s,Max. Init. Time / s,Presim. Time / s,Sim. Time / s,Virt. Memory (Sum) / kB,Local Spike Counter (Sum),Average Rate (Sum),Number of Neurons,Number of Connections,Min. Delay,Max. Delay
5,1,2,4,420.42,10,true,0.29,88.12,88.18,1.14,1.20,17.26,311.52,46560664.00,825499,7.48,112500,1265738500,1.5,1.5
5,1,4,4,200.84,10,true,0.15,46.03,46.34,0.70,1.01,7.87,142.97,46903088.00,802865,7.03,112500,1265738500,1.5,1.5
5,1,2,8,202.15,10,true,0.28,47.98,48.48,0.70,1.20,7.95,142.81,47699384.00,802865,7.03,112500,1265738500,1.5,1.5
5,1,4,8,89.57,10,true,0.15,20.41,23.21,0.23,3.04,3.19,60.31,46813040.00,821491,7.23,112500,1265738500,1.5,1.5
5,2,2,4,164.16,10,true,0.20,40.03,41.09,0.52,1.58,6.08,114.88,46937216.00,802865,7.03,112500,1265738500,1.5,1.5
5,2,4,4,77.68,10,true,0.13,20.93,21.22,0.16,0.46,3.12,52.05,47362064.00,821491,7.23,112500,1265738500,1.5,1.5
5,2,2,8,79.60,10,true,0.20,21.63,21.91,0.19,0.47,2.98,53.12,46847168.00,821491,7.23,112500,1265738500,1.5,1.5
5,2,4,8,37.20,10,true,0.13,10.08,11.60,0.10,1.63,1.24,23.29,47065232.00,818198,7.33,112500,1265738500,1.5,1.5
5,3,2,4,96.51,10,true,0.15,26.54,27.41,0.36,1.22,3.33,64.28,52256880.00,813743,7.27,112500,1265738500,1.5,1.5


df = pd.read_csv("data-nest.csv")
df.head()


df_demo.head(3)


df_demo['C']

0   -2.718282
1    1.718282
2   -1.304068
3    0.986231
4   -0.718282
Name: C, dtype: float64


df_demo.C

0   -2.718282
1    1.718282
2   -1.304068
3    0.986231
4   -0.718282
Name: C, dtype: float64


my_slice = ['A', 'C']
df_demo[my_slice]


df_demo[1:3]


df_demo[1:6:2]


df_demo[1:3]


df_demo.sort_values("C")[1:3]


df_demo.iloc[1:3]


df_demo.iloc[1:3, [0, 2]]


df_demo_indexed = df_demo.set_index("D")
df_demo_indexed


df_demo_indexed.loc["entries"]


df_demo_indexed.loc[["has", "entries"], ["A", "C"]]


df_demo[df_demo["C"] > 0]


df_demo["C"] > 0

0    False
1     True
2    False
3     True
4    False
Name: C, dtype: bool


df_demo[(df_demo["C"] < 0) & (df_demo["D"] == "entries")]


df_demo.head(3)


df_demo["F"] = df_demo["C"] - df_demo["A"]
df_demo.head(3)


df_demo.insert(df_demo.shape[1] - 1, "E2", df_demo["C"] ** 2)
df_demo.head(3)


df_demo.tail(3)


df_demo.append(
    {"A": 1.3, "B": pd.Timestamp("2018-02-27"), "C": -0.777, "D": "has it?", "E": "Same", "F": 23},
    ignore_index=True
)


df_1 = pd.DataFrame({"Key": ["First", "Second"], "Value": [1, 1]})
df_1


df_2 = pd.DataFrame({"Key": ["First", "Second"], "Value": [2, 2]})
df_2


pd.concat([df_1, df_2])


pd.concat([df_1, df_2], ignore_index=True)


pd.concat([df_1, df_2], axis=1)


pd.merge(df_1, df_2, on="Key")


df["Threads"] = df["Nodes"] * df["Tasks/Node"] * df["Threads/Task"]
df.head()


df.columns

Index(['id', 'Nodes', 'Tasks/Node', 'Threads/Task', 'Runtime Program / s',
       'Scale', 'Plastic', 'Avg. Neuron Build Time / s',
       'Min. Edge Build Time / s', 'Max. Edge Build Time / s',
       'Min. Init. Time / s', 'Max. Init. Time / s', 'Presim. Time / s',
       'Sim. Time / s', 'Virt. Memory (Sum) / kB', 'Local Spike Counter (Sum)',
       'Average Rate (Sum)', 'Number of Neurons', 'Number of Connections',
       'Min. Delay', 'Max. Delay', 'Threads'],
      dtype='object')


import matplotlib.pyplot as plt
%matplotlib inline


x = np.linspace(0, 2*np.pi, 400)
y = np.sin(x**2)


fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_title('Use like this')
ax.set_xlabel("Numbers");
ax.set_ylabel("$\sqrt{x}$");


y2 = y/np.exp(y*1.5)


fig, ax = plt.subplots()
ax.plot(x, y, label="y")
ax.plot(x, y2, label="y2")
ax.legend()
ax.set_title("This plot makes no sense");


fig, ax = plt.subplots()
ax.plot(df_demo.index, df_demo["C"], label="C")
ax.legend()
ax.set_title("Nope, no sense at all");


df.sort_values(["Threads", "Nodes", "Tasks/Node", "Threads/Task"], inplace=True)  # multi-level sort


fig, ax = plt.subplots()
ax.plot(df["Threads"], df["Presim. Time / s"], linestyle="dashed", color="red", label="Presim. Time / s")
ax.plot(df["Threads"], df["Sim. Time / s"], "-b", label="Sim. Time / s")
ax.set_xlabel("Threads")
ax.set_ylabel("Time / s")
ax.legend(loc='best');


df_demo["C"].plot(figsize=(10, 2));


df_demo.plot(y="C", figsize=(10, 2));


df_demo["C"].plot(kind="bar");


df_demo["C"].plot.bar();


df_demo["C"].plot(kind="bar", legend=True, figsize=(12, 4), ylim=(-1, 3), title="This is a C plot");


df.set_index("Threads", inplace=True)


df["Presim. Time / s"].plot(figsize=(10, 3), style="--", color="red");


df["Sim. Time / s"].plot(figsize=(10, 3), style="-b");


df["Presim. Time / s"].plot(style="--r");
df["Sim. Time / s"].plot(style="-b");


ax = df[["Presim. Time / s", "Sim. Time / s"]].plot(style=["--b", "-r"]);
ax.set_ylabel("Time / s");


df[["Presim. Time / s", "Sim. Time / s"]].plot();


df_demo[["A", "C", "F"]].plot(kind="bar", stacked=True);


df_demo[df_demo["F"] < 0][["A", "C", "F"]].plot(kind="bar", stacked=True);


df_demo[df_demo["F"] < 0][["A", "C", "F"]]\
    .plot(kind="barh", subplots=True, sharex=True, title="Subplots Demo", figsize=(12, 4));


df_demo.loc[df_demo["F"] < 0, ["A", "F"]]\
    .plot(
        style=["-*r", "--ob"], 
        secondary_y="A", 
        figsize=(12, 6),
        table=True
    );


df_demo.loc[df_demo["F"] < 0, ["A", "F"]]\
    .plot(
        style=["-*r", "--ob"], 
        secondary_y="A", 
        figsize=(12, 6),
        yerr={
            "A": df_demo[df_demo["F"] < 0]["C"], 
            "F": 0.2
        }, 
        capsize=4,
        title="Bug: style is ignored with yerr",
        marker="P"
    );


ax = df_demo["C"].plot(figsize=(10, 4))
ax.set_title("Hello There!");
fig = ax.get_figure()
fig.suptitle("This title is super (literally)!");


fig, ax = plt.subplots(figsize=(10, 4))
df_demo["C"].plot(ax=ax)
ax.set_title("Hello There!");
fig.suptitle("This title is super (still, literally)!");


fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True, figsize=(12, 4))
for ax, column, color in zip([ax1, ax2], ["C", "F"], ["blue", "#b2e123"]):
    df_demo[column].plot(ax=ax, legend=True, color=color)


import seaborn as sns
sns.set()  # set defaults


df_demo[["A", "C"]].plot();


sns.palplot(sns.color_palette())


sns.palplot(sns.color_palette("hls", 10))


sns.palplot(sns.color_palette("hsv", 20))


sns.palplot(sns.color_palette("Paired", 10))


sns.palplot(sns.color_palette("cubehelix", 8))


sns.palplot(sns.color_palette("colorblind", 10))


with sns.color_palette("hls", 2):
    sns.regplot(x="C", y="F", data=df_demo);
    sns.regplot(x="C", y="E2", data=df_demo);


x, y = np.random.multivariate_normal([0, 0], [[1, -.5], [-.5, 1]], size=300).T


sns.jointplot(x=x, y=y, kind="reg");


cols = [
    'Avg. Neuron Build Time / s', 
    'Min. Edge Build Time / s', 
    'Min. Init. Time / s', 
    'Presim. Time / s', 
    'Sim. Time / s'
]
df["Unaccounted Time / s"] = df['Runtime Program / s']
for entry in cols:
    df["Unaccounted Time / s"] = df["Unaccounted Time / s"] - df[entry]


df[["Runtime Program / s", "Unaccounted Time / s", *cols]].head(2)


df[["Unaccounted Time / s", *cols]].plot(kind="bar", stacked=True, figsize=(12, 4));


df_multind = df.set_index(["Nodes", "Tasks/Node", "Threads/Task"])
df_multind.head()


df_multind[["Unaccounted Time / s", *cols]]\
    .divide(df_multind["Runtime Program / s"], axis="index")\
    .plot(kind="bar", stacked=True, figsize=(14, 6), title="Relative Time Distribution");


df.groupby("Nodes").mean()


df_demo["H"] = [(-1)**n for n in range(5)]


df_pivot = df_demo.pivot_table(
    index="F",
    values="E2",
    columns="H"
)
df_pivot


df_pivot.plot();


df.pivot_table(
    index="Nodes",
    columns=["Tasks/Node", "Threads/Task"],
    values="Sim. Time / s",
).plot(kind="bar", figsize=(12, 4));

	Age
count	10.000000
mean	50.500000
std	9.009255
min	38.000000
25%	41.500000
50%	56.000000
75%	56.750000
max	60.000000

	Actor	Main Cast
Character
Hurley	Jorge Garcia	True
Jack	Matthew Fox	True
Kate	Evangeline Lilly	True
Locke	Terry O'Quinn	True
Sawyer	Josh Holloway	True
Walt	Malcolm David Kelley	False

	id	Tasks/Node	Threads/Task	Runtime Program / s	Scale	Plastic	Avg. Neuron Build Time / s	Min. Edge Build Time / s	Max. Edge Build Time / s	Min. Init. Time / s	...	Presim. Time / s	Sim. Time / s	Virt. Memory (Sum) / kB	Local Spike Counter (Sum)	Average Rate (Sum)	Number of Neurons	Number of Connections	Min. Delay	Max. Delay	Unaccounted Time / s
Nodes
1	5.333333	3.0	8.0	185.023333	10.0	True	0.220000	42.040000	42.838333	0.583333	...	7.226667	132.061667	4.806585e+07	816298.000000	7.215000	112500.0	1.265738e+09	1.5	1.5	2.891667
2	5.333333	3.0	8.0	73.601667	10.0	True	0.168333	19.628333	20.313333	0.191667	...	2.725000	48.901667	4.975288e+07	818151.000000	7.210000	112500.0	1.265738e+09	1.5	1.5	1.986667
3	5.333333	3.0	8.0	43.990000	10.0	True	0.138333	12.810000	13.305000	0.135000	...	1.426667	27.735000	5.511165e+07	820465.666667	7.253333	112500.0	1.265738e+09	1.5	1.5	1.745000
4	5.333333	3.0	8.0	31.225000	10.0	True	0.116667	9.325000	9.740000	0.088333	...	1.066667	19.353333	5.325783e+07	819558.166667	7.288333	112500.0	1.265738e+09	1.5	1.5	1.275000
5	5.333333	3.0	8.0	24.896667	10.0	True	0.140000	7.468333	7.790000	0.070000	...	0.771667	14.950000	6.075634e+07	815307.666667	7.225000	112500.0	1.265738e+09	1.5	1.5	1.496667
6	5.333333	3.0	8.0	20.215000	10.0	True	0.106667	6.165000	6.406667	0.051667	...	0.630000	12.271667	6.060652e+07	815456.333333	7.201667	112500.0	1.265738e+09	1.5	1.5	0.990000

H	-1	1
F
-3.918282	NaN	7.389056
-2.504068	NaN	1.700594
-1.918282	NaN	0.515929
-0.213769	0.972652	NaN
0.518282	2.952492	NaN

	Age
Name
Liu	41
Rowland	56
Rivers	56
Waters	57
Rice	39
Fields	59
Kerr	43
Romero	56
Davis	38
Hall	60

Dinosaur Name	Aegyptosaurus	Tyrannosaurus	Panoplosaurus	Isisaurus	Triceratops	Velociraptor
Favourite Prime	4	8	15	16	23	42
Favourite Color	blue	white	blue	purple	violet	gray

	A	B	C	D	E
0	1.2	2018-02-26	-2.718282	This	Same
1	1.2	2018-02-26	1.718282	column	Same
2	1.2	2018-02-26	-1.304068	has	Same
3	1.2	2018-02-26	0.986231	entries	Same
4	1.2	2018-02-26	-0.718282	entries	Same

	id	Nodes	Tasks/Node	Threads/Task	Runtime Program / s	Scale	Plastic	Avg. Neuron Build Time / s	Min. Edge Build Time / s	Max. Edge Build Time / s	...	Max. Init. Time / s	Presim. Time / s	Sim. Time / s	Virt. Memory (Sum) / kB	Local Spike Counter (Sum)	Average Rate (Sum)	Number of Neurons	Number of Connections	Min. Delay	Max. Delay
0	5	1	2	4	420.42	10	True	0.29	88.12	88.18	...	1.20	17.26	311.52	46560664.0	825499	7.48	112500	1265738500	1.5	1.5
1	5	1	4	4	200.84	10	True	0.15	46.03	46.34	...	1.01	7.87	142.97	46903088.0	802865	7.03	112500	1265738500	1.5	1.5
2	5	1	2	8	202.15	10	True	0.28	47.98	48.48	...	1.20	7.95	142.81	47699384.0	802865	7.03	112500	1265738500	1.5	1.5
3	5	1	4	8	89.57	10	True	0.15	20.41	23.21	...	3.04	3.19	60.31	46813040.0	821491	7.23	112500	1265738500	1.5	1.5
4	5	2	2	4	164.16	10	True	0.20	40.03	41.09	...	1.58	6.08	114.88	46937216.0	802865	7.03	112500	1265738500	1.5	1.5

	Runtime Program / s	Unaccounted Time / s	Avg. Neuron Build Time / s	Min. Edge Build Time / s	Min. Init. Time / s	Presim. Time / s	Sim. Time / s
Threads
8	420.42	2.09	0.29	88.12	1.14	17.26	311.52
16	202.15	2.43	0.28	47.98	0.70	7.95	142.81

	Name	Age
0	LiuLiu	82
1	RowlandRowland	112
2	RiversRivers	112

	Key	Value
0	First	1
1	Second	1

	Key	Value
0	First	2
1	Second	2

Data Analysis and Plotting in Python with Pandas¶

My Motivation¶

Task Outline¶

Tutorial Setup¶

About Pandas¶

Pandas Cohabitation¶

First Steps¶

DataFrames¶

It's all about DataFrames¶

DataFrames¶

Construction¶

DataFrames¶

Examples, finally¶

Task 1¶

More DataFrame examples¶

Reading External Data¶

Task 2¶

Read CSV Options¶

Slicing of Data Frames¶

Quick Slices¶

Slicing of Data Frames¶

Better Slicing¶

Slicing of Data Frames¶

Advanced Slicing: Logical Slicing¶

Adding to Existing Data Frame¶

Combining Frames¶

Task 3¶

Aside: Plotting without Pandas¶

Matplotlib 101¶

Task 4¶

Plotting with Pandas¶

Task 5¶

More Plotting with Pandas¶

Recap: Our first proper Pandas plot¶

More Plotting with Pandas¶

Some versatility¶

Combine Pandas with Matplotlib¶

Option 1: Pandas Returns Axis¶

Option 2: Draw on Matplotlib Axes¶

Aside: Seaborn¶

Seaborn Color Palette Example¶

Seaborn Plot Examples¶

Task 6¶

Next Level: Hierarchical Data¶

Pivoting¶

Task 7¶

Task 7B (like Bonus)¶

Conclusion¶

Further Reading¶

More `DataFrame` examples¶